We will analyse alternative splicing in the same dataset with the DEXSeq R package (see Love, Huber, and Anders (2014)).

1 Overview

The flow is very similar to the DESeq2 tutorial - however all analyses are done on exon level.

2 Data used in this case study

Loss of function of myosin chaperones triggers Hsf1-mediated transcriptional response in skeletal muscle cells
Christelle Etard, Olivier Armant, Urmas Roostalu, Victor Gourain, Marco Ferg and Uwe Strähle
Genome Biology 2015 16:267 https://doi.org/10.1186/s13059-015-0825-8

RNA-seq_Strahle_Lab_0005AS.<SequencingID>.USERvgourain.R.ReadsPerGene.out.tab
Hpf wt unc45b
24 DCD001548SQ DCD001560SQ
DCD001559SQ DCD001554SQ
48 DCD001546SQ DCD001564SQ
DCD001558SQ DCD001555SQ
72 DCD001547SQ DCD001565SQ
DCD001545SQ DCD001551SQ

All the libraries were unstranded paired-ended sequenced on Illumina HiSeq 2000 producing 50bp long reads.

3 Data preparation

  • Order paired-end reads by name and convert the data from BAM to SAM format by using samtools.
$ samtools sort -n RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.genome.bam | samtools view -h - > RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.genome.sorted.sam
$ samtools sort -n RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.genome.bam | samtools view -h - > RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.genome.sorted.sam
$ samtools sort -n RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.genome.bam | samtools view -h - > RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.genome.sorted.sam
$ samtools sort -n RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.genome.bam | samtools view -h - > RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.genome.sorted.sam
  • Annotation conversion into counting exons from GTF to GFF format.
$ python dexseq_prepare_annotation.py Danio_rerio.GRCz10.87.gtf Danio_rerio.GRCz10.87.gff
  • Counting reads within exons.
$ python dexseq_count.py -p yes -s no <ANNOTATION.GFF> <INPUT.BAM> <OUTPUT.tab>
  Options:
    -s, whether the data is from a strand-specific assay 
    -p, whether the data is paired-end
    <ANNOTATION.GFF>, pre-processed annotation file using dexseq_prepare_annotation.py python script
    <INPUT.BAM>, mapped reads

$ python dexseq_count.py -p yes -s no Danio_rerio.GRCz10.87.gff RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.genome.sorted.sam RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.genome.sorted-DEXseq.tab
$ python dexseq_count.py -p yes -s no Danio_rerio.GRCz10.87.gff RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.genome.sorted.sam RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.genome.sorted-DEXseq.tab
$ python dexseq_count.py -p yes -s no Danio_rerio.GRCz10.87.gff RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.genome.sorted.sam RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.genome.sorted-DEXseq.tab
$ python dexseq_count.py -p yes -s no Danio_rerio.GRCz10.87.gff RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.genome.sorted.sam RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.genome.sorted-DEXseq.tab

Download scripts from the DEXseq source package DEXSeq_1.24.1.tar.gz.

4 Import data into DEXSeq

The remainder of the analysis is now done in R. We will use the output of the python scripts from above.

# import the data
directory <- "../data/provided_files/data_DEXseq/"
# the path of the count files
sampleFiles <- grep("*.tab", list.files(directory), value = TRUE)
countFiles = paste(directory, sampleFiles, sep = "")
# Flattened GFF file
flattenedFile = paste(directory, "Danio_rerio.GRCz10.87.gff", sep = "")

First, we need to prepare a sample table like in DESeq2. This table should contain one row for each library, and columns for all relevant information such as name of the file with the read counts, experimental conditions, technical information and further covariates. Here, we simply put in the file paths and conditions.

sampleTable = data.frame(row.names = c("wt.1", "unc45b.1", "wt.2", "unc45b.2"), 
    condition = c("wt", "unc45b", "wt", "unc45b"))
# show the sampleTable
sampleTable
##          condition
## wt.1            wt
## unc45b.1    unc45b
## wt.2            wt
## unc45b.2    unc45b

Then we built a DEXSeqDataSet object using the data:

suppressPackageStartupMessages(library("DEXSeq"))

dxd = DEXSeqDataSetFromHTSeq(countFiles, sampleData = sampleTable, design = ~sample + 
    exon + condition:exon, flattenedfile = flattenedFile)

The function takes four arguments:

  • A vector with names of count files (files that have been generated with the dexseq_count.py script):

  • The sample table, with one row for each of the files listed in the first argument.

  • A formula of the form “ sample + exon + condition:exon” that specifies the contrast with of a variable from the sample table columns and the ‘exon’ variable.

    • Using this formula, we are interested in differences in exon usage due to the ‘condition’ variable changes.
  • A file name of the flattened GFF file (generated with dexseq_prepare_annotation.py)

4.1 See the imported data

The DEXSeqDataSet class is derived from the DESeqDataSet. As such, it contains the usual accessor functions for the column data, row data, and some specific ones. The core data in an DEXSeqDataSet object are the counts per exon. Each row of the DEXSeqDataSet contains in each column the count data from a given exon (’this’) as well as the count data from the sum of the other exons belonging to the same gene (’others’). This annotation, as well as all the information regarding each column of the DEXSeqDataSet, is specified in the colData.

colData(dxd)
## DataFrame with 8 rows and 3 columns
##     sample condition     exon
##   <factor>  <factor> <factor>
## 1     wt.1        wt     this
## 2 unc45b.1    unc45b     this
## 3     wt.2        wt     this
## 4 unc45b.2    unc45b     this
## 5     wt.1        wt   others
## 6 unc45b.1    unc45b   others
## 7     wt.2        wt   others
## 8 unc45b.2    unc45b   others

We can access the first 5 rows from the count data by doing:

head(counts(dxd), 5)
##                         [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## ENSDARG00000000001:E001    5    2    9    9  203   88  108  181
## ENSDARG00000000001:E002   17   12   16   25  191   78  101  165
## ENSDARG00000000001:E003   91   33   36   67  117   57   81  123
## ENSDARG00000000001:E004   11    8    6   11  197   82  111  179
## ENSDARG00000000001:E005   20   10   13   20  188   80  104  170

Notice that the number of columns is 8, the first four (we have four samples) corresponding to the number of reads mapping to out exonic regions and the last four correspond to the sum of the counts mapping to the rest of the exons from the same gene on each sample.


We can also access only the first five rows from the count belonging to the exonic regions (’this’) (without showing the sum of counts from the rest of the exons from the same gene) by doing:

head(featureCounts(dxd), 5)
##                         wt.1 unc45b.1 wt.2 unc45b.2
## ENSDARG00000000001:E001    5        2    9        9
## ENSDARG00000000001:E002   17       12   16       25
## ENSDARG00000000001:E003   91       33   36       67
## ENSDARG00000000001:E004   11        8    6       11
## ENSDARG00000000001:E005   20       10   13       20

5 Normalisation and dispersion estimation

Different samples might be sequenced with different depths. In order to adjust for such coverage biases, we estimate size factors, which measure relative sequencing depth. DEXSeq uses the same method as DESeq and DESeq2, which is provided in the function estimateSizeFactors.

# estimate size factors
dxd <- estimateSizeFactors(dxd)
sizeFactors(dxd)
# export normalised data
normCountTable <- data.frame(featureCounts(dxd, normalized = TRUE))
write.table(normCountTable, file = "../data/results/results_DEXseq/DEXSeq-wt_vs_unc45b-normalised-counts.tab", 
    quote = FALSE, sep = "\t")

6 dispersion estimation

To estimate the dispersion estimates, DEXSeq uses the approach of the package DESeq2. Internally, the functions from DESeq2 are called, adapting the parameters of the functions for the specific case of the DEXSeq model. Briefly, per-exon dispersions are calculated using a Cox-Reid adjusted profile likelihood estimation, then a dispersion-mean relation is fitted to this individual dispersion values and finally, the fitted values are taken as a prior in order to shrink the per-exon estimates towards the fitted values.

We have run this code for you - DO NOT RUN AT THIS TUTORIAL

dxd <- estimateDispersions(dxd)

Instead we’ll upload the DEXSeqDataSet with this already performed and plot the dispersion estimates:

load("DEXSeq.Rdata")
plotDispEsts(dxd)  #plot per-exon dispersion estimates versus the mean normalised count, the resulting fitted values and the a posteriori (shrinked) dispersion estimates

7 Testing for differential exon usage

Having the dispersion estimates and the size factors, we can now test for differential exon usage. For each gene, DEXSeq fits a generalized linear model with the formula:
∼ sample + exon + condition:exon
and compare it to the smaller model (the null model)
∼ sample + exon

The functions - we have run this code for you - DO NOT RUN AT THIS TUTORIAL

dxd <- testForDEU(dxd)

The resulting DEXSeqDataSet object contains slots with information regarding the test. For some uses, we may also want to estimate relative exon usage fold changes. To this end, we call estimateExonFoldChanges .

dxd = estimateExonFoldChanges(dxd, fitExpToVar = "condition")

8 Exploring results

So far in the pipeline, the intermediate and final results have been stored in the meta data of a DEXSeqDataSet object, they can be accessed using the function mcol. In order to summarize the results without showing the values from intermediate steps, we call the function DEXSeqResults.

# get results
dxr1 <- DEXSeqResults(dxd)
head(dxr1)
# export result table
write.table(dxr1, "../data/results/results_DEXseq/DEXseqHTML-wt_vs_unc45b-results.tab")

From this object, we can ask how many exonic regions are significant with a false discovery rate of 10%:

table(dxr1$padj < 0.1)
## 
## FALSE  TRUE 
## 78995   630

We may also ask how many genes are affected

table(tapply(dxr1$padj < 0.1, dxr1$groupID, any))
## 
## FALSE  TRUE 
##  1499   398

To see how the power to detect differential exon usage depends on the number of reads that map to an exon, a so-called MA plot is useful, which plots the logarithm of fold change versus average normalised count per exon and marks by red colour the exons which are considered significant; here, the exons with an adjusted p values of less than 0.1 for example

# MA plot witg of exons with an adjusted p values of less than 0.1
plotMA(dxr1, cex = 0.8)

8.1 Plot regulated exons

To generate an easily browsable, detailed overview over all analysis results, the package provides an HTML report generator, implemented in the function DEXSeqHTML. This function uses the package hwriter (Pau and Huber (n.d.)) to create a result table with links to plots for the significant results, allowing a more detailed exploration of the results.

To consider time - look at the path described and open the html to have a look

DEXSeqHTML(dxr1, FDR = 0.1, color = c("#FF000080", "#0000FF80"), path = "./DEXseqHTML-wt_vs_unc45b")

Additionally, exons can be plotted individually:

# plot regulated exons
plotDEXSeq(dxr1, "ENSDARG00000003081", legend = TRUE, cex.axis = 1.2, cex = 1.3, 
    lwd = 2)

9 Sessioninfo

References

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (12): 550. doi:10.1186/s13059-014-0550-8.

Pau, Gregoire, and Wolfgang Huber. n.d. “The hwriter Package Composing HTML documents with R objects.” https://journal.r-project.org/archive/2009-1/RJournal{\_}2009-1{\_}Pau+Huber.pdf.